Learning Objectives

  • Further your understanding of R and the R Studio interface
  • Learn the basics in data manipulation using the dplyr package
  • Learn to use help() to look up function documentation
  • Run code from a script

Introduction to Data Management

Now that you’ve learned the basics of R programming, we’ll take things a step further and start working on your skills related to data analysis. You will likely be unfamiliar with some of the operations you need to execute in this exercise. Part of the goal with this exercise, however, is for you to become more familiar with the help commands in R and with the internet solutions that exist. Our ultimate goal is to make you aware of the tools that are available so that you can become an effective problem solver, working independently on data analyses.

Running Code in a Script

So far, we’ve run code in the Console. This is fine for quick queries. For anything to be shareable, reproducible, and to reduce our own efforts in the long-term, we should run code via a script. An R script is a just a simple text file. R-Studio uses the script by copying R commands from text in the file and pastes them into the Console as if you were manually entering the commands yourself. This greatly enhances our ability to build off what we have created in the past, learn from previous experience, and quickly re-run analyses when new data are received. To create an R script:

  1. From the “file” mean, select “New File”
  2. Click “R Script from the list of options

R-Studio will open your R script automatically after creating it. Notice that the scripting window appears above the Console in what’s known as the Source pane.

  1. Save the R script in your working directly in the Scripts folder. Name this file IntroDataMgmt.R.

Figure 1: Create a script for running analyses
At the top of your script, provide brief information that describes the content of your script. The content is up to you, but should briefly identify who created the file, when the file was created, and what the script does. This will help when you return to the file at a later date or if you decide to share the file with a colleague in the future. Remember that anything after the # symbol is a comment. Use this symbol to make your code more readable, similar to below.

# ******************************************************************
# ******************************************************************

# Project: Introduction to Data Analysis in R
# Description: A script which details some basic commands on how to manipulate data
# Author: <Your Name>
# Date Initialized: <dd month yyyy>

# ******************************************************************
# ******************************************************************

Running an R Script

Running code via an R script is different than running code in the Console. To interpret and run the code you’ve written, R needs you to send the code from the script to the Console. Some common ways to run your code include:

  1. Highlight the line of code you want to run and use the shortcut Ctrl + Enter (Windows) or command + return (Mac).
  2. Highlight your code a physically click the Run button in the top right of the Source pane with your mouse.

Clearing your Workspace

You’ll find that your Environment (Workspace) in the upper right panel will quickly become full with user-defined objects. It’s generally good practice to work with a clean Workspace when starting a session. I generally start all my scripts with the following command to make sure you are starting fresh, something we will do to help develop good programming practices and reduce clutter.

# Clean your workspace/remove all objects
rm(list=ls())

# You can also remove a specific dataset using the following command
#rm(dataset)

Data Table Manipulation with Dplyr

The most basic R skill is to query and manipulate data tables. As a beginner programmer, it is imperative to familiarize yourself with how to manipulate data. Reinforcing these skills is like expanding your vocabulary in the new language that you are learning and is a great way to improve your R proficiency. If you wish to become really good at R, but don’t know where to start, start with data table manipulation!

The base R functions that come with the default R installation have the capacity for almost all the table manipulation needs (e.g., split(), subset(), apply(), sapply(), lapply(), tapply(), aggregate()). However, sometimes their syntax are less user-friendly and intuitive than some of the special packages built for table manipulation purposes. So, here we are introducing a few of the most useful table manipulation functions within dplyr package.

Note that you will have to use install.packages() and library() function to download and activate the dplyr before using it. You only need to install the package once on your computer. You will need, however, to ‘activate’ the package any time you want to use the functions that exist within the package.

#install.packages("dplyr")
library(dplyr)

Reading/Importing Data

R has multiple functions for reading in table data. Here we’ll use the base function read.csv() to import a table named panda_data.csv that is located in your Data folder. Text files (.txt) can be imported using the function read.delim(). See the help files for each function and search Google for information on other functions to read other data types.

View the first few rows of the data table using the function head() or click on the dataframe in the Environment/History panel.

# Read dataset
panda_data <- read.csv(file="Data/panda_data.csv")

# Look at the data
head(panda_data)
##   panda_name      ID age weight_kg sex base genetic_value1 genetic_value2
## 1      da_da 4415463   5       100   m   CD             99             50
## 2    mao_mao 4415522   4       120   f   CD             70             30
## 3    lan_lan 4416073   7        95   f   WL             80             NA
## 4    bei_bei 4416405   5       120   f   WL             80             NA
## 5    bao_bao 4417779   5       110   m   WL             75             60
## 6  tian_tian 4424490   4        89   m   WL             95             70
##   genetic_value3 year
## 1             60 2017
## 2             90 2017
## 3             80 2017
## 4             65 2017
## 5             76 2017
## 6             87 2017

Questions:

  1. Can you change the number of rows that are displayed with the head() function?
  2. How would you print the first 3 rows?
  3. How would you print rows 4 thru 6?
  4. What are the dimensions of the dataframe? How many rows and columns are there?
  5. Can you guess how you might look at the last few rows of the dataframe?

Selecting Columns

The function select() is a powerful tool for selecting columns of interest. You must specify the dataset you want to query and then also provide an expression for selecting columns of interest (select(.data, expression)). A few examples are provided below:

# select column called panda_name
select(panda_data, panda_name) 
##       panda_name
## 1          da_da
## 2        mao_mao
## 3        lan_lan
## 4        bei_bei
## 5        bao_bao
## 6      tian_tian
## 7        wei_wei
## 8  shuang_shuang
## 9         qiuqiu
## 10     lang_lang
# select all columns in the data except panda_name
select(panda_data, -panda_name)
##         ID age weight_kg sex base genetic_value1 genetic_value2 genetic_value3
## 1  4415463   5       100   m   CD             99             50             60
## 2  4415522   4       120   f   CD             70             30             90
## 3  4416073   7        95   f   WL             80             NA             80
## 4  4416405   5       120   f   WL             80             NA             65
## 5  4417779   5       110   m   WL             75             60             76
## 6  4424490   4        89   m   WL             95             70             87
## 7  4424967   7        98   f   CD             60             NA             92
## 8  4424657   8       110   m   WL             92             NA             50
## 9  4423758   9       130   f   CD             81             50             30
## 10 4427758  10       110   m   WL             60             NA             70
##    year
## 1  2017
## 2  2017
## 3  2017
## 4  2017
## 5  2017
## 6  2017
## 7  2017
## 8  2017
## 9  2018
## 10 2018
# select a range of columns, from age to sex
select(panda_data, age:sex)
##    age weight_kg sex
## 1    5       100   m
## 2    4       120   f
## 3    7        95   f
## 4    5       120   f
## 5    5       110   m
## 6    4        89   m
## 7    7        98   f
## 8    8       110   m
## 9    9       130   f
## 10  10       110   m

Various selection helpers also exist, including:

  • starts_with: Expression select multiple columns that start with the same text.
  • ends_with(): Expression to select columns that end with the same text.
  • contains(): Expression to select columns that contain the same text.
  • matches(): Expression to select columns that match a regular expression.
  • one_of(): Expression to select columns that are from a group of names.
# select all columns that start with "genetic" in their column names
select(panda_data, starts_with("genetic")) 
##    genetic_value1 genetic_value2 genetic_value3
## 1              99             50             60
## 2              70             30             90
## 3              80             NA             80
## 4              80             NA             65
## 5              75             60             76
## 6              95             70             87
## 7              60             NA             92
## 8              92             NA             50
## 9              81             50             30
## 10             60             NA             70

Questions:

  1. Select all the columns that contain “value” in the column name.
  2. Select or exclude two columns: panda_name and age.

Filtering Data

Filter() is similar to select(), except that you are selecting specific rows that satisfy a requirement based on a column value. This function is very similar to the base function subset(). A few examples on how to use filter() are provided below:

Artwork by Allison Horst
Artwork by Allison Horst


# Select rows where pandas are greater than or equal to 5 years of age
filter(panda_data, age >= 5)
##      panda_name      ID age weight_kg sex base genetic_value1 genetic_value2
## 1         da_da 4415463   5       100   m   CD             99             50
## 2       lan_lan 4416073   7        95   f   WL             80             NA
## 3       bei_bei 4416405   5       120   f   WL             80             NA
## 4       bao_bao 4417779   5       110   m   WL             75             60
## 5       wei_wei 4424967   7        98   f   CD             60             NA
## 6 shuang_shuang 4424657   8       110   m   WL             92             NA
## 7        qiuqiu 4423758   9       130   f   CD             81             50
## 8     lang_lang 4427758  10       110   m   WL             60             NA
##   genetic_value3 year
## 1             60 2017
## 2             80 2017
## 3             65 2017
## 4             76 2017
## 5             92 2017
## 6             50 2017
## 7             30 2018
## 8             70 2018
# select rows that have age>5 OR weight_kg >100 
filter(panda_data, age > 5 | weight_kg > 100)
##      panda_name      ID age weight_kg sex base genetic_value1 genetic_value2
## 1       mao_mao 4415522   4       120   f   CD             70             30
## 2       lan_lan 4416073   7        95   f   WL             80             NA
## 3       bei_bei 4416405   5       120   f   WL             80             NA
## 4       bao_bao 4417779   5       110   m   WL             75             60
## 5       wei_wei 4424967   7        98   f   CD             60             NA
## 6 shuang_shuang 4424657   8       110   m   WL             92             NA
## 7        qiuqiu 4423758   9       130   f   CD             81             50
## 8     lang_lang 4427758  10       110   m   WL             60             NA
##   genetic_value3 year
## 1             90 2017
## 2             80 2017
## 3             65 2017
## 4             76 2017
## 5             92 2017
## 6             50 2017
## 7             30 2018
## 8             70 2018
# select rows that have age>5 AND base column has CD has entry
filter(panda_data, age > 5 & base == "CD") 
##   panda_name      ID age weight_kg sex base genetic_value1 genetic_value2
## 1    wei_wei 4424967   7        98   f   CD             60             NA
## 2     qiuqiu 4423758   9       130   f   CD             81             50
##   genetic_value3 year
## 1             92 2017
## 2             30 2018
# Select rows where the panda age is defined by a few values.  Note the use of the %in% function.
filter(panda_data, age %in% c(4,5,7))
##   panda_name      ID age weight_kg sex base genetic_value1 genetic_value2
## 1      da_da 4415463   5       100   m   CD             99             50
## 2    mao_mao 4415522   4       120   f   CD             70             30
## 3    lan_lan 4416073   7        95   f   WL             80             NA
## 4    bei_bei 4416405   5       120   f   WL             80             NA
## 5    bao_bao 4417779   5       110   m   WL             75             60
## 6  tian_tian 4424490   4        89   m   WL             95             70
## 7    wei_wei 4424967   7        98   f   CD             60             NA
##   genetic_value3 year
## 1             60 2017
## 2             90 2017
## 3             80 2017
## 4             65 2017
## 5             76 2017
## 6             87 2017
## 7             92 2017

Questions:

1: Select rows with NA in the genetic_value2 column.
2: Select rows whose panda_name column are bao_bao or bei_bei.

Using the Pipe Operator

Piping (%>%) allows the user to combine the output from one function to the input of another. Thus, instead of nesting functions (reading from the inside to the outside), piping reads functions from left to right (i.e., the way we normally read things). As a result, reading piped code can be more intuitive and can help avoid creating and saving a lot of intermediate variables that you don’t need.

# Use a simple pipe to select the panda name and it's sex, and output the result.
# Create a new object named 'pipe_result'
pipe_result <- panda_data %>%
  select(panda_name, sex) %>%
  head()

# Output the result to the screen 
pipe_result
##   panda_name sex
## 1      da_da   m
## 2    mao_mao   f
## 3    lan_lan   f
## 4    bei_bei   f
## 5    bao_bao   m
## 6  tian_tian   m

Questions:

  1. Use a pipe operator to select all columns containing “genetic” in their names, but only select rows that have a genetic_value1 >80 AND genetic_value2 <90. Don’t create a new object when you create the query (i.e., just print the result to your screen).
  2. What do you think will happen if you filter based on a field that hasn’t been selected? For instance, select all columns containing “genetic”, but select the rows where animals weigh > 90 kg.

Mutate

One of the most useful functions in dplyr package is mutate(). Mutate allows the user to create new column(s), populating the columns with values that you define or from information in columns that already exist. You can also use mutate() to control which columns are retained in the new object that you create by setting the .keep argument.

Artwork by Allison Horst
Artwork by Allison Horst


# Create a new column, based on values from other columns that exist
# By default, keep = all
new_col_ex1 <- panda_data %>%
  mutate(genetic_value_new = genetic_value1 - genetic_value2,
         .keep = "all")  
# Print
new_col_ex1
##       panda_name      ID age weight_kg sex base genetic_value1 genetic_value2
## 1          da_da 4415463   5       100   m   CD             99             50
## 2        mao_mao 4415522   4       120   f   CD             70             30
## 3        lan_lan 4416073   7        95   f   WL             80             NA
## 4        bei_bei 4416405   5       120   f   WL             80             NA
## 5        bao_bao 4417779   5       110   m   WL             75             60
## 6      tian_tian 4424490   4        89   m   WL             95             70
## 7        wei_wei 4424967   7        98   f   CD             60             NA
## 8  shuang_shuang 4424657   8       110   m   WL             92             NA
## 9         qiuqiu 4423758   9       130   f   CD             81             50
## 10     lang_lang 4427758  10       110   m   WL             60             NA
##    genetic_value3 year genetic_value_new
## 1              60 2017                49
## 2              90 2017                40
## 3              80 2017                NA
## 4              65 2017                NA
## 5              76 2017                15
## 6              87 2017                25
## 7              92 2017                NA
## 8              50 2017                NA
## 9              30 2018                31
## 10             70 2018                NA
# You can create multiple columns at once.  Best to put each new column on a separate line.  This simply makes the code more readable.
new_col_ex2 <- panda_data %>%
    mutate(genetic_dif = genetic_value1 - genetic_value2, 
           weight_g = weight_kg * 1000) 
# Print
new_col_ex2
##       panda_name      ID age weight_kg sex base genetic_value1 genetic_value2
## 1          da_da 4415463   5       100   m   CD             99             50
## 2        mao_mao 4415522   4       120   f   CD             70             30
## 3        lan_lan 4416073   7        95   f   WL             80             NA
## 4        bei_bei 4416405   5       120   f   WL             80             NA
## 5        bao_bao 4417779   5       110   m   WL             75             60
## 6      tian_tian 4424490   4        89   m   WL             95             70
## 7        wei_wei 4424967   7        98   f   CD             60             NA
## 8  shuang_shuang 4424657   8       110   m   WL             92             NA
## 9         qiuqiu 4423758   9       130   f   CD             81             50
## 10     lang_lang 4427758  10       110   m   WL             60             NA
##    genetic_value3 year genetic_dif weight_g
## 1              60 2017          49   100000
## 2              90 2017          40   120000
## 3              80 2017          NA    95000
## 4              65 2017          NA   120000
## 5              76 2017          15   110000
## 6              87 2017          25    89000
## 7              92 2017          NA    98000
## 8              50 2017          NA   110000
## 9              30 2018          31   130000
## 10             70 2018          NA   110000

Question:

  1. Create a new column in panda_data called zero and give it a value of 0?

Arrange, Relocate, and Rename

It is common to receive data in ways that we want to change or alter. For instance, we might want to change the order of the columns, re-order the rows based on particular column values, or we may want to change the column headings. Mutate() includes options to change the order of columns that you create (see .before and .after in the help for the function). Relocate() can be used move columns around using the .before and .after commands, arrange() can be used to order the rows, and rename() can be used to give new names to column headings. By default, arrange will arrange the selecting column in ascending order. We can use desc() to arrange the column in descending order.

Artwork by Allison Horst
Artwork by Allison Horst


# Arrange the panda_data by panda_name in descending order
# Place the sex column after the ID column
order_data_ex1 <- panda_data %>% 
    arrange(desc(panda_name)) %>% 
  relocate(sex, .after = ID)
# Print
head(order_data_ex1)
##      panda_name      ID sex age weight_kg base genetic_value1 genetic_value2
## 1       wei_wei 4424967   f   7        98   CD             60             NA
## 2     tian_tian 4424490   m   4        89   WL             95             70
## 3 shuang_shuang 4424657   m   8       110   WL             92             NA
## 4        qiuqiu 4423758   f   9       130   CD             81             50
## 5       mao_mao 4415522   f   4       120   CD             70             30
## 6     lang_lang 4427758   m  10       110   WL             60             NA
##   genetic_value3 year
## 1             92 2017
## 2             87 2017
## 3             50 2017
## 4             30 2018
## 5             90 2017
## 6             70 2018
# Arrange the panda_data by genetic_value 1 and genetic_value2 after select the columns that start with 'genetic'
# Rename the genetic_value1 column to gen_val1
order_data_ex2 <- panda_data %>% 
    select(starts_with("genetic")) %>%
    arrange(genetic_value1, genetic_value2) %>% 
  rename(gen_val1 = genetic_value1)
# Print
head(order_data_ex2)
##   gen_val1 genetic_value2 genetic_value3
## 1       60             NA             92
## 2       60             NA             70
## 3       70             30             90
## 4       75             60             76
## 5       80             NA             80
## 6       80             NA             65

Question:

  1. Arrange the panda_data by age and then by panda weight, putting the year column before age and the base column after the last column? Rename the weight column to panda_wgt_kg.

Summarize

Summary statistics are easily calculated using the summarise() function (note: summarize() will also work). The summary statistics will be calculated among all rows in the dataframe, unless specified otherwise. Summarize() is often used in combination with group_by() (see below for more information). Various useful summary functions are provided in the help documentation for the function.

# Calculate the mean weight of the pandas and provide a count.
sum_table <- panda_data %>% 
  summarise(mean_wgt = mean(weight_kg),
            n = n())
# Print summary table
sum_table
##   mean_wgt  n
## 1    108.2 10
# Calculate the avg weight and minimum weight.  Don't create a new object.
panda_data %>% 
  summarise(avg_wt = mean(weight_kg), 
            min_wt = min(weight_kg))
##   avg_wt min_wt
## 1  108.2     89

Grouping Functions

Group_by() can by used to divide data rows into groups based on grouping column(s) that we specify. This function is often used in combination with others which define what you do with them after placing them in groups. When group_by() and summarise() are used together, you are essentially telling R to separate rows into different groups, and for each groups you use summarise() to generate a series of summary statistics that characterize the column values.

# Similar to above, calculate average and minimum weight, but summarize based on 'base' column.  Include a count of each group.
panda_data %>%
  group_by(base) %>%
  summarise(avg_wt = mean(weight_kg),
            min_wt = min(weight_kg),
            n = n())
## # A tibble: 2 × 4
##   base  avg_wt min_wt     n
##   <chr>  <dbl>  <int> <int>
## 1 CD      112      98     4
## 2 WL      106.     89     6
# Group summaries can also be calculated across muultiple groups.
# Here, we calculate the same as above, but based on base and sex
panda_data %>%
  group_by(base, sex) %>%
  summarise(avg_wt = mean(weight_kg),
            min_wt= min(weight_kg),
            n = n())
## # A tibble: 4 × 5
## # Groups:   base [2]
##   base  sex   avg_wt min_wt     n
##   <chr> <chr>  <dbl>  <int> <int>
## 1 CD    f       116      98     3
## 2 CD    m       100     100     1
## 3 WL    f       108.     95     2
## 4 WL    m       105.     89     4

Joining Tables

The last of the important tools we will cover in dplyr is how to join tables together. This is common practice in ecological statistics, with data in one table that are required to be appended to another table for analyses. To join these tables, we need to link the tables based on shared columns.

As an example, we will import a table (panda_data_med.csv) located in our Data folder that summarizes the vaccination history of each our pandas. Similar to other functions, we have multiple options to join the tables together. The most common join to use is a left_join(). In this case, all the records in our first table (i.e., the left table) are linked with those that match or are shared with the second data table (i.e., the right table). The means that we will keep all the records in the left table and include only those records that match in the right table (they will be returned NA if no match exists).

Other join options, including inner_join(), right_join, and full_join() also exist. Read the help file to determine which join is most appropriate to accomplish the activity of interest.

Note: Sometimes the column headings between tables do not match, even if the data within the column does (ID in table 1 != id in table 2). In these cases, you can either rename() the column headings to make them match or specify columns you want to join.

# Read in the vaccination table
panda_med <- read.csv(file="Data/panda_data_med.csv")

# Join all the rows in table 1 (panda_data) with table 2 (panda_med) to determine which of the pandas were vaccinated.  Since ID exists in both tables, this is a straightforward join.
# Arrange the result by ID and year vaccinated
panda_join_ex1 <- panda_data %>% 
  left_join(panda_med, by = "ID") %>% 
  arrange(ID, year_vaccination)
# Print
head(panda_join_ex1)
##   panda_name.x      ID age weight_kg sex base genetic_value1 genetic_value2
## 1        da_da 4415463   5       100   m   CD             99             50
## 2      mao_mao 4415522   4       120   f   CD             70             30
## 3      mao_mao 4415522   4       120   f   CD             70             30
## 4      lan_lan 4416073   7        95   f   WL             80             NA
## 5      bei_bei 4416405   5       120   f   WL             80             NA
## 6      bao_bao 4417779   5       110   m   WL             75             60
##   genetic_value3 year panda_name.y year_vaccination     vaccine_type
## 1             60 2017        da_da             2018 canine distemper
## 2             90 2017      mao_mao             2017 canine distemper
## 3             90 2017      mao_mao             2018           rabies
## 4             80 2017         <NA>               NA             <NA>
## 5             65 2017         <NA>               NA             <NA>
## 6             76 2017         <NA>               NA             <NA>
# Example of how to join based on multiple fields and fields that don't exactly match
panda_join_ex2 <- panda_data %>%
  left_join(panda_med,
            by = c("ID" = "ID", 
                   "year"= "year_vaccination")) 
# Print
head(panda_join_ex2)
##   panda_name.x      ID age weight_kg sex base genetic_value1 genetic_value2
## 1        da_da 4415463   5       100   m   CD             99             50
## 2      mao_mao 4415522   4       120   f   CD             70             30
## 3      lan_lan 4416073   7        95   f   WL             80             NA
## 4      bei_bei 4416405   5       120   f   WL             80             NA
## 5      bao_bao 4417779   5       110   m   WL             75             60
## 6    tian_tian 4424490   4        89   m   WL             95             70
##   genetic_value3 year panda_name.y     vaccine_type
## 1             60 2017         <NA>             <NA>
## 2             90 2017      mao_mao canine distemper
## 3             80 2017         <NA>             <NA>
## 4             65 2017         <NA>             <NA>
## 5             76 2017         <NA>             <NA>
## 6             87 2017         <NA>             <NA>

Excercise

Text here